This analysis was initiated by Urška Sršen, co-founder of Bellabeat. Sršen knows that an analysis of available data on consumers’ smart device usage would reveal opportunities for the company to grow, and has provided the following business task:
Analyse smart device usage data to gain insight into how people are already using smart devices, then generate high-level recommendations for how these insights can inform the marketing strategy for one Bellabeat product.
For this analysis I wanted a tool or set of tools with the following features:
With these requirements in mind, I considered three tools: R, spreadsheets, and databases:
| Feature | R | Spreadsheets | Databases |
|---|---|---|---|
| Power for large data sets | Yes | No | Yes |
| Data manipulation tools | Yes | Yes | Yes |
| Data analysis tools | Yes | Yes | No |
| Data visualisation tools | Yes | Yes | No |
| Separate analysis files | Yes | No | Yes |
| Streamlined report generation | Yes | No | No |
| Straightforward to learn | No | Yes | No |
Clearly, R is the best single tool for this analysis. R lacks the straightfoward operation of spreadsheet tools, and I will need to learn libraries and programming techniques as I do my analysis, but this is acceptable given my prior experience with other programming languages like Python.
Note: Python itself was not considered for this analysis: while it shares most of the features, advantages, and disadvantages of R, I’m already familiar with Python and wanted to use this case study to familiarise myself with R instead
The first preparation stage involves setting up my RStudio environment for my analysis.
The R chunk below automatically loads all packages included in the “rqd_pkgs” list, installing them first if required: this ensures all packages can be loaded by other analysts replicating my work, and minimises the effort required to modify the package list.
For this data analysis, I’ll be making use of one public data set specified by Sršen, plus additional data sets as required to address any limitations found in that dataset.
The data set specified by Sršen is the FitBit Fitness Tracker Data Set. This is a public data set available under the CC0 License via Kaggle user Mobius.
For the initial analysis, the data set was downloaded in its entirety from Kaggle and stored locally on my computer. This provided a baseline for analysis, with any modifications to file names, folder re-structuring, or removal of unnecessary data tables to be conducted after I’m familiar with the raw data.
To get a top-level view of the data, I first use the readr package to load all of the required data sets directly into the my R environment.
csv_dir <- "Fitabase_Data_Cleaned"
paths_dfs <- list.files(csv_dir, pattern = "*.csv", full.names = TRUE)
df_names <- paths_dfs %>%
basename() %>%
tools::file_path_sans_ext()
for (i in 1:length(df_names)) {
assign(df_names[i], read_csv(paths_dfs[i]))
cat(df_names,"\n")
}
rm(i, paths_dfs)
The database contains data on the following features of the FitBit devices:
| Feature | Units | Sampling Rate |
|---|---|---|
| BMI | BMI | Manual/automatic logging |
| Body Weight | kg | Manual/automatic logging |
| Body Fat | % | Manual/automatic logging |
| Calorie burn | Calories | 1 minute |
| Distance | Unknown | 1 minute |
| Heart Rate | Beats per Minute (BPM) | 5 seconds |
| Intensity (of exercise) | Factor, 0 to 3 | 1 minute |
| METs (during exercise) | METs | 1 minute |
| Sleep | Factor, 1 to 3 | 1 minute |
| Steps | Steps | 1 minute |
Some details of the data gathering are not yet clear to me:
The file names in the data set present the following issues:
To correct each of these issues, I’ll rename the files using this naming convention:
[feature]_[src/sum]_[interval]_[shape].[filetype]
For example, “sleepDay_merged.csv” will be renamed to “sleep_sum_days_wide.csv”.
All variables in the data set are named in CapitalisedCase, whereas I would typically use snake_case by convention. This is a relatively minor issue that I could go without correcting, however a small number of variable names, like logId, are also inconsistently capitalised across different tables. Since I’ll be adjusting some names anyway, and there’s presumably a library function to change this with minimal effort, I’ll put it on the data-cleaning to-do list.
Some of the tables in the data set use variables of a type unsuitable for analysis. These variables and their required modifications are tabulated below:
| Variable | Original Type | Updated Type | Reason |
|---|---|---|---|
| ActivityDay | chr | datetime | Cannot perform datetime operations on chr variables |
| ActivityHour | chr | datetime | Cannot perform datetime operations on chr variables |
| ActivityMinute | chr | datetime | Cannot perform datetime operations on chr variables |
| Date | chr | datetime | Cannot perform datetime operations on chr variables |
| Id | num | chr | Disable scientific notation and numerical operations (IDs are not a numeric value) |
| LogId | num | chr | Disable scientific notation and numerical operations (IDs are not a numeric value) |
| SleepDay | chr | datetime | Cannot perform datetime operations on chr variables |
| Time | chr | datetime | Cannot perform datetime operations on chr variables |
All of the numeric variables in the data set have clearly defined units except for “Distance”. Distance does not appear to have a source data table: it’s only included in the “activity_sum_wide_days” and “intensity_sum_wide_days” tables, and only as summary data grouped via Intensity level. Floating-point values between 0 and 1 are present, so it seems reasonable to assume this is either kilometers or miles, as opposed to meters or feet. No geographical information is given in the data set, so I can’t assume the participants are from the U.S., where miles would be appropriate. Given miles and kilometers represent the same information on slightly different scales, any insights about different use cases between users should still be apparent, therefore I think it’s reasonable to assume the distances are given in kilometers for this analysis.
Two variables in the data set, exercise “Intensity” and sleep “Value”, are numerical factors with no defined range. For these factors, I can’t tell from the data alone whether all possible values that a FitBit can record are present. The values for exercise intensity, for example, range from 0 to 3; it could be the case that the FitBits used only generate four levels of intensity, but it could just as easily be the case that the values go up to 100 (i.e. a percentage). This has clear implications for our analysis: if 3 is the max, records of 3 indicate users wear their FitBits while exercising as hard as they can, whereas if 100 is the max, records of 3 indicate users wear their FitBits while sitting on the couch as hard as they can.
With this in mind, I’m going to invest a bit of time to confirm the meaning of these variables before attempting to analyse them (read: about ten hours to investigate everything and learn how to do it all in R and learn how to make it look suitably pretty and coherent in RMarkdown).
I’ll start by determining the range of values present in the Intensity data:
cat("Min Intensity:", min(intensity_src_mins_tall$intensity),
"\nMax Intensity:", max(intensity_src_mins_tall$intensity), "\n", sep = "")
The “activity_sum_days_wide” table provides some context clues as to what the levels might mean: the table summarises Intensity data into four new variables which, based on their names, appear to be associated with intensity levels like so:
Let’s see if I can confirm this by recreating the data with that naming convention:
# Generate my version of sleepDay_merged for comparison with the original
intensity_daily_sum_wide <- intensity_src_mins_tall %>%
mutate(activity_date_floored = floor_date(mdy_hms(ActivityMinute), unit = "days")) %>%
group_by(Id, activity_date_floored) %>%
summarize(
minutes_sedentary = sum(case_when(Intensity == 0 ~ 1, TRUE ~ 0)),
minutes_lightly_active = sum(case_when(Intensity == 1 ~ 1, TRUE ~ 0)),
minutes_fairly_active = sum(case_when(Intensity == 2 ~ 1, TRUE ~ 0)),
minutes_very_active = sum(case_when(Intensity == 3 ~ 1, TRUE ~ 0))
) %>%
mutate(Id_ActivityDate_UID = paste(Id, activity_date_floored, sep = "_")) %>%
arrange(Id, activity_date_floored, Id_ActivityDate_UID)
# Compare both versions of the data and return any dates with different values
intensity_daily_comp <- activity_sum_days_wide %>%
mutate(ActivityDate_floored = floor_date(mdy(ActivityDate), unit = "days")) %>%
mutate(Id_ActivityDate_UID = paste(Id, ActivityDate_floored, sep = "_")) %>%
with(merge(
.,
intensity_daily_sum_wide,
by = c("Id_ActivityDate_UID"),
all = TRUE
)
) %>%
mutate(diff_minutes_sedentary = minutes_sedentary - SedentaryMinutes ) %>%
mutate(diff_minutes_lightly_active = minutes_lightly_active - LightlyActiveMinutes ) %>%
mutate(diff_minutes_fairly_active = minutes_fairly_active - FairlyActiveMinutes ) %>%
mutate(diff_minutes_very_active = minutes_very_active - VeryActiveMinutes ) %>%
select(
Id_ActivityDate_UID,
diff_minutes_sedentary,
diff_minutes_lightly_active,
diff_minutes_fairly_active,
diff_minutes_very_active
) %>%
filter(!(diff_minutes_sedentary == 0 &
diff_minutes_lightly_active == 0 &
diff_minutes_fairly_active == 0 &
diff_minutes_very_active == 0)
) %>%
arrange(Id_ActivityDate_UID)
# For this table, glimpse() shows enough to demonstrate the validity of the method
glimpse(intensity_daily_comp)
rm(intensity_daily_sum_wide, intensity_daily_comp)
The approach above appears to work perfectly, with the exception of the “sedentary minutes” calculation, which is consistently higher in my version.
I think its safe to conclude that I got the mapping correct, given that: a) It’s consistently higher by at least 6 hours, which I suspect is caused by my method counting time asleep as “sedentary minutes” - which is not technically wrong, you know - and b) Reversing the order of the mapping produces entirely wrong results.
That being said, I still don’t know if these are the categories FitBits work in, not another naming convention that the data authors came up with, and I don’t know if all FitBit models work this way, so let’s do something I should have done from the start: read the manuals.
Activity trackers like FitBits detect activity intensity partly by measuring the user’s heart rate while exercising: a higher heart rate corresponds with a higher degree of exertion. As of April 2016, the three latest FitBit models with heart-rate tracking were:
A quick look through the product manuals for each model confirms they all break down user activity into four default heart-rate zones:
| Product | HR Zone 1 | HR Zone 2 | HR Zone 3 | HR Zone 4 |
|---|---|---|---|---|
| FitBit Blaze | “Out of Zone” | “Fat burn” | “Cardio” | “Peak” |
| FitBit Charge HR | “Out of Zone” | “Fat burn” | “Cardio” | “Peak” |
| FitBit Surge | “Out of Zone” | “Fat burn” | “Cardio” | “Peak” |
While these aren’t exactly the same terms as used in the data set, they’re clearly related - “Out of Zone” equates to “Sedentary”, for example.
All three FitBit manuals also make the same claim that the default zones are “based on American Heart Association recommendations”. Even without validating that claim, it indicates to me that the reasoning behind each zone is not arbitrary, and is consistent across devices, so I think I can assume any other FitBit models circa 2016 would follow the same classification scheme.
At this point, I’m satisfied that the below are the only four intensity levels I need to consider when analysing the data set, regardless of what models of FitBits were being used:
As with exercise intensity, I start by determining the range of values present in the data:
cat("Min Sleep:", min(sleep_src_mins_tall$value),
"\nMax Sleep:", max(sleep_src_mins_tall$value), "\n", sep = "")
The values for sleep quality range from 1 to 3. The summary data tables for sleep quality introduce only two new variable names: “Total Time In Bed”, and “Total Minutes Asleep”. There’s not a valid name for each level of factor like there was for exercise, so in this case we go straight back to the manuals:
Further poking around the FitBit help pages on how to track sleep stats and what they all mean reveals that different devices track slightly different data if they have heart rate tracking:
The help pages also single out the Charge HR and the Surge as the only HR-tracking FitBits to not have full sleep stage tracking, leaving the Blaze as the only device from this time period with that feature. Blaze aside, motion-based sleep quality tracking appears to go all the way back to the FitBit One. Given this information, it seems fair to assume the following mapping for the sleep data:
I can confirm this mapping by attempting to recreare duplicate the summary data in sleepDay_merged:
As with the intensity data, I can confirm this by recreating the data with that naming convention:
# Generate my version of sleepDay_merged for comparison with the original
sleep_src_mins_tall_NW <- sleep_src_mins_tall %>%
# mutate(date_typed = mdy_hms(date)) %>%
mutate(date_floored = floor_date(mdy_hms(date), unit = "days")) %>%
# Sum time asleep for each Log ID
group_by(logId) %>%
summarize(
"Id" = min(Id),
# Associate each Log ID with the latest date recorded under it
"SleepDay" = max(date_floored),
"minutes_in_bed" = n(),
"minutes_awake" = sum(case_when(value == 3 ~ 1, TRUE ~ 0)),
"minutes_restless" = sum(case_when(value == 2 ~ 1, TRUE ~ 0)),
"minutes_asleep" = sum(case_when(value == 1 ~ 1, TRUE ~ 0))
) %>%
# Sum time asleep for each date based on SleepDay
group_by(Id, SleepDay) %>%
summarize(
"TotalSleepRecords_2" = n(),
"TotalMinutesAsleep_2" = sum(minutes_asleep),
"TotalTimeInBed_2" = sum(minutes_in_bed),
"TotalMinutesAwake" = sum(minutes_awake),
"TotalMinutesRestless" = sum(minutes_restless),
) %>%
mutate("Id_SleepDay_UID" = paste(Id, SleepDay, sep = "_")) %>%
arrange(Id_SleepDay_UID)
# Compare both versions of the data and return any dates with different values
sleepDay_comp <- sleep_sum_days_wide %>%
# mutate("SleepDay_typed" = mdy_hms(SleepDay)) %>%
mutate("SleepDay_floored" = floor_date(mdy_hms(SleepDay), unit = "days")) %>%
mutate("Id_SleepDay_UID" = paste(Id, SleepDay_floored, sep = "_")) %>%
arrange(Id_SleepDay_UID) %>%
with(merge(
.,
sleep_src_mins_tall_NW,
by = c("Id_SleepDay_UID"),
all = TRUE
)
) %>%
mutate(recordDiff = TotalSleepRecords_2 - TotalSleepRecords) %>%
mutate(sleepDiff = TotalMinutesAsleep_2 - TotalMinutesAsleep) %>%
mutate(bedDiff = TotalTimeInBed_2 - TotalTimeInBed) %>%
select(
Id_SleepDay_UID,
recordDiff,
sleepDiff,
bedDiff
) %>%
filter(!(recordDiff == 0 & sleepDiff == 0 & bedDiff == 0)) %>%
arrange(Id_SleepDay_UID)
# For this table, glimpse() shows enough to demonstrate the validity of the method
glimpse(sleepDay_comp)
rm(sleep_src_mins_tall_NW, sleepDay_comp)
I was able to recreate the existing sleep_src_mins_tall table almost perfectly by summing sleep times per Log ID, with the latest date associated with each Log ID being used as the “date” for that sleep. In practice it turns out I had the mapping inverted, and actually the following is used:
So I guess read that as “1 is highest-quality sleep, 3 is worst-quality”.
My version of the data contains a few rows that vary slightly from the original, by 1 to 22 minutes. I haven’t been able to determine the source of this error, but they’re more than close enough to confirm I don’t have the factor level mapping backwards, so I’m ready to proceed.
The data set includes some tables that contain source data for a given feature, e.g. heart-rate tracking, and others that contain summary data, e.g. everything in the “activity_days_sum_wide” table.
Some of these are useful, for instance:
Others are not useful, for instance:
Those tables that do not provide useful summaries can be excluded from the data analysis: if a specific need is found for their data, they can be reloaded or recreated manually as required.
The “bodycomp_logs_src_wide.csv” file contains both kilogram and pound variables: these describe the same information, and all of the observations contain values for both variables, so one variable can be dropped with no loss of data. The choice between the two formats seems arbitrary for my analysis, so I’m choosing to keep the kilos data as its expressed in an SI unit.
intensity_sum_hours_wide.csv contains Total Intensity and Average Intensity. Total intensity values exceed 4, the maximum for intensity, and so are not actually useful. Average Intensity is just the Total Intensity for each hour divided by 60 minutes per hour.
For this analysis, I’ll rename the files to use this naming convention:
[feature]_[src/sum]_[interval]_[shape].[filetype]
Applying the naming convention to the data set yields the following file names:
| Original | Updated |
|---|---|
| dailyActivity_merged.csv | activity_sum_days_wide.csv |
| dailyCalories_merged.csv | calories_sum_days_tall.csv |
| dailyIntensities_merged.csv | intensity_sum_days_wide.csv |
| dailySteps_merged.csv | steps_sum_days_tall.csv |
| heartrate_seconds_merged.csv | heartrate_src_seconds_tall.csv |
| hourlyCalories_merged.csv | calories_sum_hours_tall.csv |
| hourlyIntensities_merged.csv | intensity_sum_hours_wide.csv |
| hourlySteps_merged.csv | steps_sum_hours_tall.csv |
| minuteCaloriesNarrow_merged.csv | calories_src_mins_tall.csv |
| minuteCaloriesWide_merged.csv | calories_sum_mins_wide.csv |
| minuteIntensitiesNarrow_merged.csv | intensity_src_mins_tall.csv |
| minuteIntensitiesWide_merged.csv | intensity_sum_mins_wide.csv |
| minuteMETsNarrow_merged.csv | mets_src_mins_tall.csv |
| minuteSleep_merged.csv | sleep_src_mins_tall.csv |
| minuteStepsNarrow_merged.csv | steps_src_mins_tall.csv |
| minuteStepsWide_merged.csv | steps_sum_mins_wide.csv |
| sleepDay_merged.csv | sleep_sum_days_wide.csv |
| weightLogInfo_merged.csv | bodycomp_src_logs_wide.csv |
The conversion was performed manually on my local device.
Each data frame was checked for NULL values and non-null empty strings.
The only table with missing data was “bodycomp_src_logs_wide”, which was missing all but two of the data points for Body Fat Percentage. Given this lack of data, Body Fat Percentage will be excluded from the data analysis.
# Check for NULLs
cat("Checking for NULL/empty values...\n", sep="")
for (df_name in df_names) {
df <- get(df_name)
num_nulls <- sum(is.na(df))
if(num_nulls) {
cat(df_name,": ",num_nulls," NULLs","\n",sep="")
}
# Check for empty strings (which do not show up as NULLs)
empty_strings <- df %>%
filter(if_any(where(is.character), ~ nchar(.) == 0))
num_empty_strings = nrow(empty_strings)
if(num_empty_strings) {
cat(df_name,": ",num_empty_strings," empty strings","\n",sep="")
}
}
cat("Checking for NULL/empty values done.\n", sep="")
rm(df, df_name, num_nulls, empty_strings, num_empty_strings, df_name)
Functions and environment variables used in the code below have been omitted for brevity and can be found in Appendix A.
## Rename Variables ----
cat("Cleaning variable names...\n", sep = "")
for(df_name in df_names) {
cat("Cleaning ",df_name,"...\n", sep = "")
assign(df_name, get(df_name) %>% clean_names())
}
cat("Cleaning variable names complete.\n", sep = "")
cat("Renaming variables...\n", sep = "")
var_mods_rename <- var_mods %>%
filter(var_old != "" & var_new != "")
for(df_name in df_names) {
assign(df_name, rename_df_variables(df_name, var_mods_rename))
}
rm(df_name)
cat("Renaming variables complete.\n", sep = "")
All dataframes were checked for duplicate rows using the anyDuplicated() function, and duplicate rows were removed using the distinct() function. The function was tested using a prototype version that generated dataframes of all of the apparent duplicate rows: these were verified manually before the function was allowed to modify the actual dataframes. The logic was verified again by re-running it after the initial removal: all dataframes reported zero duplicates on the second run, confirming the success of the first pass.
cat("Checking for duplicate values...\n",sep="")
for (df_name in df_names) {
cat("Checking ",df_name," for duplicates... ",sep="")
df <- get(df_name)
if (!anyDuplicated(df)) {
cat("0 duplicates removed. Done.\n",sep="")
} else {
nrow_before <- nrow(df)
df <- distinct(df)
nrow_after <- nrow(df)
cat("Removing ",(nrow_before - nrow_after)," duplicates... ",sep="")
assign(df_name, df)
cat("Done.\n",sep="")
}
}
rm(df, df_name, nrow_before, nrow_after)
cat("Checking for duplicate values complete.\n",sep="")
| Variable | Original Type | Updated Type | Reason |
|---|---|---|---|
| activity_day | chr | datetime | Cannot perform datetime operations on chr variables |
| activity_hour | chr | datetime | Cannot perform datetime operations on chr variables |
| activity_minute | chr | datetime | Cannot perform datetime operations on chr variables |
| date | chr | datetime | Cannot perform datetime operations on chr variables |
| id | num | chr | Disable numerical operations (IDs are considered a UID string) |
| log_id | num | chr | Disable numerical operations (IDs are considered a UID string) |
| time | chr | datetime | Cannot perform datetime operations on chr variables |
| sleep_rank | num | factor 1:3 | Disable numerical operations (value is a ranking, not an amount) |
| intensity | num | factor 0:3 | Disable numerical operations (value is a ranking, not an amount) |
# Global Variable Declarations ----
var_mods_recast <- var_mods %>%
filter(type_new != "")
# Function Declarations ----
are_identical_lists <- function(list1, list2) {
if (length(list1) != length(list2)) {
return(FALSE)
}
for (i in seq_along(list1)) {
if(!identical(list1[[i]], list2[[i]])) {
cat("Non-identical lists at list1[",i,"]. Exiting.\n", sep = "")
print(list1[[i]])
print(list2[[i]])
rm(i)
return(FALSE)
}
}
rm(i)
return(TRUE)
}
get_df_var_types <- function(df_name) {
cat("Getting current variable types for ", df_name, "... ", sep = "")
df <- get(df_name)
var_types <- data.frame(
var = names(df),
type = sapply(df, function(col) class(col)[1])
)
cat("Done.\n", sep = "")
return(var_types)
}
get_df_target_var_types <- function(df_name) {
cat("Getting target variable types for ", df_name, "...\n", sep = "")
var_types <- get_df_var_types(df_name)
# Iterate over rows in current var_types
for (var_row in 1:nrow(var_types)) {
# Check if var name is in var_mods_recast
var_name <- var_types$var[var_row]
for(mods_row in 1:nrow(var_mods_recast)) {
var_name_mods <- var_mods_recast$var_new[mods_row]
# If no, skip: if yes, replace type with target
if(var_name == var_name_mods) {
type_new <- var_mods_recast$type_new[mods_row]
cat("Updated \"", var_name, "\" from ", var_types$type[var_row], " to ", var_mods_recast$type_new[mods_row], ".\n", sep = "")
var_types$type[var_row] <- type_new
}
}
}
cat("Getting target variable types for ", df_name, " done.\n", sep = "")
return(var_types)
}
recast_variables <- function(df_name) {
df <- get(df_name)
for (i in 1:nrow(var_mods_recast)) {
var_new <- var_mods_recast$var_new[i]
type_new <- var_mods_recast$type_new[i]
if (var_new %in% colnames(df)) {
cat("Converting ",df_name,"$",var_new," to ",type_new, "... ", sep = "")
if (type_new == "character") {
df <- df %>% mutate("{var_new}" := as.character(!!sym(var_new)))
} else if (type_new == "Date") {
df <- df %>% mutate("{var_new}" := mdy(!!sym(var_new)))
} else if (type_new == "POSIXct") {
df <- df %>% mutate("{var_new}" := mdy_hms(!!sym(var_new)))
} else {
cat("type_new not found: not converting.", sep = "")
}
cat("Done.\n", sep = "")
}
}
rm(i)
return(df)
}
# Recast Variables ----
cat("Generating list of target column types for testing...\n", sep = "")
df_types_target <-lapply(df_names, get_df_target_var_types)
names(df_types_target) <- df_names
cat("Generating list of target column types complete.\n", sep = "")
cat("Recasting variables...\n", sep = "")
for (df_name in df_names) {
cat("Recasting ",df_name,"...\n", sep = "")
assign(df_name, recast_variables(df_name))
cat("Converting ",df_name," complete.\n", sep = "")
}
cat("Recasting variables complete.\n", sep = "")
# Test Recasting of Variables ----
cat("Generating list of updated column types...\n", sep = "")
df_types_after <- lapply(df_names, get_df_var_types)
names(df_types_after) <- df_names
cat("Generating list of updated column types complete.\n", sep = "")
cat("Checking updated column types against target types...\n", sep = "")
test_succeeded <- are_identical_lists(df_types_after, df_types_target)
cat("Checking updated column types against target types complete.\n", sep = "")
cat("Data recasting ", case_when(test_succeeded ~ "succeeded", TRUE ~"failed"), ".", sep = "")
rm(df_name, df_types_target, df_types_after, test_succeeded)
# Individual recast of variables with only one occurrence
sleep_src_mins_tall <- sleep_src_mins_tall %>%
mutate(sleep_rank = factor(sleep_rank, levels = 1:3, labels = c("Asleep", "Restless", "Awake")))
intensity_src_mins_tall <- intensity_src_mins_tall %>%
mutate(intensity = factor(intensity, levels = 0:3, labels = c("Sedentary", "Lightly Active", "Fairly Active", "Very Active")))
Given the large number of variables in the data set, the recasting procedure includes test code to confirm the updated variable types match the types specified in the variable mods list. The test code works by first generating a list of the desired final variable/type pairs, then, once the conversion is completed, generating a second list of the actual variable/type pairs in the data frames to compare it to. This has the advantage of not just confirming the desired conversions took place, but also checks for any unintended changes to variables that did not require conversion.
Building the test code added a decent amount of work to the project, but now I have it working, it can be reused and scaled to future projects.
cat("Trimming whitespace in variable names...\n", sep="")
for (df_name in df_names) {
df <- get(df_name)
for (col_name in colnames(df)) {
col_name_trimmed <- str_trim(col_name)
cat("Trimming ",df_name, "[",col_name,"] to ",col_name_trimmed,"... ",sep="")
df <- df %>% rename(!!col_name := !!col_name_trimmed)
cat("Done.\n", sep = "")
}
}
rm(df, df_name, col_name, col_name_trimmed)
cat("Trimming whitespace in variable names complete.\n", sep="")
trim_chr_column <- function(col) {
if (is.character(col)) {
return(str_trim(col))
} else {
return(col)
}
}
cat("Trimming whitespace in character-type data values...\n", sep="")
for (df_name in df_names) {
df <- get(df_name)
for (col_name in colnames(df)) {
if (is.character(df[[col_name]])) {
cat("Trimming ",df_name, "[",col_name,"]... ",sep="")
# df <- df %>% mutate(if_any(where(is.character), trim_chr_column))
df <- df %>% mutate(across(all_of(col_name), str_trim))
cat("Done.\n", sep = "")
}
}
}
rm(df, df_name, col_name)
cat("Trimming whitespace in character-type data values complete.\n", sep="")
Given that the data was not entered manually by the user, there’s no way for me to manually check the correctness of all of the numeric values included in the data set. The values were instead checked against pre-determined limits to verify that they fall within realistic ranges, as detailed below.
Validating the data in this way also helps confirm the data makes sense in terms of the business logic, by confirming the data falls within realistic ranges given the capabilities of the devices and the types of data they claim to track.
I wrote a short function to validate the length of all rows in a data frame for a given column number and valid length. This was then used to check ID values in the data set against their correct length, including:
The validation confirmed all values were the correct length.
validate_string_length <- function(df_name, col_name, valid_length) {
cat("Checking ",df_name," for invalid ",col_name," values... ",sep="")
df <- get(df_name)
if (col_name %in% colnames(df)) {
invalid_values <- df %>% select(!!sym(col_name)) %>% filter(nchar(!!sym(col_name)) != valid_length)
}
if (exists('invalid_values') && nrow(invalid_values) > 0) {
cat(nrow(invalid_values)," invalid values found:\n",sep="")
glimpse(invalid_values)
rm(invalid_values)
} else {
cat("complete.\n",sep="")
}
rm(df)
}
result <- map(df_names, ~validate_string_length(.x, col_name = "id", valid_length = 10))
result <- map(df_names, ~validate_string_length(.x, col_name = "sleep_log_id", valid_length = 11))
result <- map(df_names, ~validate_string_length(.x, col_name = "bodycomp_log_id", valid_length = 13))
rm(result)
To further validate the large number of ID data points in the data set, I wrote a short function to check all ID variables in each data frame and determine the minimum and maximum values. The intent was to identify any possible erroneous values within the acceptable length limits, e.g. all zeroes or all nines. The function makes use of the direct conversion of strings to numerals in the min() and max() functions to carry out the comparison on character-type values.
The validation confirmed all values were within a realistic range.
cat("Checking min/max ID value ranges...\n",sep="")
id_col_names <- c("id", "sleep_log_id", "bodycomp_log_id")
for (df_name in df_names) {
df <- get(df_name)
for (col_name in id_col_names) {
if(col_name %in% colnames(df)) {
cat(df_name,"[",col_name,"] Min = ",min(df[[col_name]])," Max = ",max(df[[col_name]]),".\n",sep="")
}
}
}
rm(df, df_name, col_name, id_col_names)
cat("Checking min/max ID value ranges complete.\n",sep="")
The data set is described as containing data from users collected between “03.12.2016-05.12.2016”: all records in the data set were validated against this date range.
validate_dates <- function(df_name) {
valid_dates_stt <- as.Date("2016-03-12", format = "%Y-%m-%d")
valid_dates_end <- as.Date("2016-05-14", format = "%Y-%m-%d")
cat("Validating dates for ",df_name,"... ",sep="")
date_columns <- get(df_name) %>%
select_if(function(col) is.POSIXct(col) || is.Date(col))
if (ncol(date_columns) == 0) {
cat("Error: no date column found.\n",sep="")
} else if (ncol(date_columns) > 1) {
cat("Error: more than one date column found for ",df_name,".\n",sep="")
} else {
outside_range <- date_columns %>%
filter(if_any(everything(), ~ . < valid_dates_stt | . > valid_dates_end))
if (nrow(outside_range) > 0) {
cat("found ",nrow(outside_range)," invalid values:\n",sep="")
print(outside_range)
rm(outside_range)
} else {
cat("complete.\n")
}
}
rm(date_columns)
}
result <- lapply(df_names, validate_dates)
rm(result)
This check found dates just outside the range, dating up to 8am on 05.13.2016, the day after the data set supposedly ended. I didn’t consider this to be a problem, so the valid end-date was updated to the 14th of May 2016 accordingly, and all dates passed this check.
Checklist: - All numeric values are non-negative - Percentage values are less than 100 - Weight values are positive and make sense (e.g. less than 200kg) - BMI values are in range - Daily, hourly, and minute duration sums are no more than one day, hour, or minute, respectively - Distances make sense - Step counts make sense (check for >20000 for a start) - Calories are within normal range - Heart rates are less than 200
First, a check for negative values was run on all numeric columns in the data set.
validate_numerics <- function(df_name) {
# This function performs all checks on numeric values that are required in more than one data-frame, e.g. non-negativity and summation
cat("Validating numerics for ",df_name,"... ",sep="")
numerics <- get(df_name) %>% select_if(is.numeric)
if (ncol(numerics) == 0) {
cat("No numeric variables found.\n",sep="")
} else {
# Check for negative values
negative_values <- numerics %>%
filter(if_any(everything(), ~ . < 0))
if (nrow(negative_values) > 0) {
cat("found ",nrow(negative_values)," invalid values:\n",sep="")
print(negative_values)
} else {
cat("complete.\n")
}
rm(negative_values)
# Check for summation
}
rm(numerics)
}
result <- lapply(df_names, validate_numerics)
rm(result)
Results:
Second, variable-specific checks were run to confirm the data fell within realistic ranges given my understanding of the variables being measured.
Note: The maximum values given are used as thresholds above which values may not be realistic, not as hard limits for acceptability: a heart-rate of 200bpm, for example, is entirely possible, but it is high enough that I would want to check if the data point corresponded to a period of high-intensity exercise.
# For a given list of dfs, column names, and a min-max range, check all matching columns in all matching dfs against that range
validate_within_range <- function(df_name, column_names, range_min, range_max) {
df <- get(df_name)
for (col_name in column_names) {
cat("Checking ranges for ",df_name,"[",col_name,"]... ",sep="")
if (!(col_name %in% colnames(df))) {
cat("column not found.\n")
} else {
out_of_range <- df %>%
filter(!!sym(col_name) < range_min | !!sym(col_name) > range_max)
if (nrow(out_of_range) <= 0) {
cat("complete.\n",sep="")
} else {
cat("Found ",nrow(out_of_range)," out-of-range values:\n",sep="")
glimpse(out_of_range)
}
rm(out_of_range)
}
}
rm(df, col_name)
}
# Minute summation
valid_df_names <- c(
"activity_sum_days_wide",
"intensity_sum_days_wide",
"sleep_sum_days_wide")
valid_col_names <- c(
"minutes_very_active",
"minutes_fairly_active",
"minutes_lightly_active",
"minutes_sedentary",
"minutes_asleep_total",
"minutes_in_bed_total")
range_max <- 60 * 12 # Minutes in half a day
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Weights (kg)
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("weight_kg")
range_max <- 150 # Arbitrarily chosen as high enough to be a potentially erroneous value
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Weights (pounds, same limit as for weight in kilos)
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("weight_pounds")
kg2lb <- 2.204623
range_max <- 150 * kg2lb # Arbitrarily chosen as high enough to be a potentially erroneous value
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
rm(kg2lb)
# BMI
valid_df_names <- c("bodycomp_src_logs_wide")
valid_col_names <- c("bmi")
range_max <- 40.0 # Corresponds with the WHO "Obese (Class III)" weight category
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Distances
valid_df_names <- c(
"activity_sum_days_wide",
"intensity_sum_days_wide")
valid_col_names <- c(
"distance_lightly_active",
"distance_logged_activities",
"distance_moderately_active",
"distance_sedentary",
"distance_total",
"distance_tracker",
"distance_very_active")
range_max <- 21.08241 # Equivalent to one half-marathon
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Step counts
valid_df_names <- c(
"activity_sum_days_wide",
"steps_sum_days_tall",
"steps_sum_hours_tall",
"steps_src_mins_tall")
valid_col_names <- c(
"steps",
"steps_total")
range_max <- 14800 # Set to double the average daily step count for Australian adults
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Calories
valid_df_names <- c(
"activity_sum_days_wide",
"calories_src_mins_tall",
"calories_sum_days_tall",
"calories_sum_hours_tall")
valid_col_names <- c("calories")
range_max <- 4000 # Chosen arbitrarily as double the typically-recommended daily caloric intake
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# TODO: METs
valid_df_names <- c("mets_src_mins_tall")
valid_col_names <- c("mets")
range_max <- 12 # Equivalent to vigourous squash playing
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
# Heart Rates
valid_df_names <- c("heartrate_src_seconds_tall")
valid_col_names <- c("heart_rate")
range_max <- 200 # Chosen based on average 100% heart-rate for a 20 y.o. (Source: American Heart Foundation)
result <- map(valid_df_names, ~validate_within_range(.x, valid_col_names, 0, range_max))
rm(range_max, result, valid_df_names, valid_col_names)
Results:
With this in mind, I attempted to corroborate the METs data by plotting it against other related data, specifically calories-burned and heart-rate. Individual charts were plotted for each user ID.
# Generate minute-level summary heart-rate data for merging with calorie/
heartrate_sum_minutes_tall <- heartrate_src_seconds_tall %>%
mutate(activity_minute = floor_date(heart_rate_second, unit = "minute")) %>%
group_by(id, activity_minute) %>%
summarize(heart_rate_mean = mean(heart_rate))
mets_vs_heart_rate <- mets_src_mins_tall %>%
merge(., intensity_src_mins_tall, by = c("id", "activity_minute")) %>%
merge(., heartrate_sum_minutes_tall, by = c("id", "activity_minute"))
mets_vs_calories <- mets_src_mins_tall %>%
merge(., intensity_src_mins_tall, by = c("id", "activity_minute")) %>%
merge(., calories_src_mins_tall, by = c("id", "activity_minute"))
plot_mets_vs_heartrate <- ggplot(data = mets_vs_heart_rate) +
geom_point(mapping=aes(x=heart_rate_mean, y=mets, shape=intensity, color=intensity)) +
geom_smooth(mapping=aes(x=heart_rate_mean, y=mets), method = "lm", se = FALSE, color = "blue") +
stat_cor(aes(x=heart_rate_mean, y=mets, label=..rr.label..), label.x=0, label.y=50) +
facet_wrap(vars(id))
print(plot_mets_vs_heartrate)
plot_mets_vs_calories <- ggplot(data = mets_vs_calories) +
geom_point(mapping=aes(x=calories,y=mets, shape=intensity, color=intensity)) +
# geom_smooth(mapping=aes(x=calories,y=mets), method = "lm", se = FALSE, color = "blue") +
stat_cor(aes(x=calories, y=mets, label=..rr.label..), label.x=0, label.y=50) +
facet_wrap(vars(id))
print(plot_mets_vs_calories)
plot_mets_vs_intensity <- ggplot(data = mets_vs_calories) +
geom_point(mapping=aes(x=intensity,y=mets)) +
# geom_smooth(mapping=aes(x=calories,y=mets), method = "lm", se = FALSE, color = "blue") +
#stat_cor(aes(x=calories, y=mets, label=..rr.label..), label.x=0, label.y=50) +
facet_wrap(vars(id))
print(plot_mets_vs_intensity)
Results:
Given the very tight relationship between METs and calories, and the unclear sampling method used to generate the MET data, I decided to keep the METs data as-is, and consider all values as valid for the purposes of cleaning..After cleaning the data, further analysis of the MET data may shed more light on its meaning: in the meantime, the calories, heart-rate, and intensity variables can be used to analyse energy expenditure and activity type.
Sources for ranges: * BMI: Based on WHO statistical categories for classifying BMI values: https://en.wikipedia.org/wiki/Body_mass_index#Categories * Half-marathon: https://en.wikipedia.org/wiki/Half_marathon * Step-count: https://www.abs.gov.au/statistics/health/health-conditions-and-risks/australian-health-survey-physical-activity/latest-release#pedometer-steps * Caloric intake: Chosen arbitrarily as double my own daily required intake, based on a calculator accessed at: https://www.eatforhealth.gov.au/nutrition-calculators/daily-energy-requirements-calculator * Max HR: https://www.heart.org/en/healthy-living/fitness/fitness-basics/target-heart-rates * METs: https://www.hsph.harvard.edu/nutritionsource/staying-active/, https://onlinelibrary.wiley.com/doi/epdf/10.1002/clc.4960130809
Use the advertised features of the Ivy to guide the analysis
These are generic functions I developed for use during the analysis. They are declared here to be made available to subsequent code chunks.
round_data_to_bin <- function(data, bin_width) {
rounding_factor <- 1 / bin_width
return(round(data * rounding_factor) / rounding_factor)
}
get_histogram_max_count <- function(data, bin_width) {
rounding_factor <- 1 / bin_width
# Round data to nearest multiple of bin_width
data_rounded <- round_data_to_bin(data, bin_width)
max_count <- max(table(data_rounded))
return(max_count)
}
rescale_plot <- function(p, x_min = 0, x_max = 10, x_step = 1, y_min = 0, y_max = 10, y_step = 1) {
p <- p +
scale_x_continuous(breaks = seq(x_min, x_max, by=x_step),
labels = scales::comma_format(),
limits=c(x_min, x_max)) +
scale_y_continuous(breaks = seq(y_min, y_max, by=y_step),
labels = scales::comma_format(),
limits=c(y_min, y_max))
return(p)
}
plot_histo_pareto <- function (data, bin_width) {
# Cumulative data uses same bins as histogram
data <- sort(data)
data_rounded <- round_data_to_bin(data, bin_width)
# Calculate highest count so axes can be adjusted
highest_count <- get_histogram_max_count(data, bin_width)
data_pareto <- seq(1, length(data), by = 1) / length(data) * highest_count
# Calculate stats for overlay on histogram
data_max <- max(data)
data_mean <- round(mean(data), digits = 2)
data_median <- round(median(data), digits = 2)
sum_stats <- data.frame(Statistics = c("Mean", "Median"),
value = c(data_mean, data_median))
label_text <- paste("Mean =", data_mean, ", Median =", data_median)
p <- ggplot() +
geom_histogram(aes(x = data),
color = "white",
binwidth = bin_width) +
geom_line(aes(x = data_rounded,
y = data_pareto),
color="forestgreen") +
geom_vline(data = sum_stats,
aes(xintercept = value,
linetype = Statistics,
color = Statistics),
size = 1) +
scale_x_continuous(breaks = seq(0, data_max + bin_width, by = bin_width)) +
scale_y_continuous(name = "Users",
breaks = seq(0, highest_count, by = 1),
sec.axis = sec_axis(~./highest_count, name = "Cumulative Percentage of Users")) +
# annotate("text", x = Inf, y = Inf, label = label_text,
# hjust = 1, vjust = 1, size = 4) +
labs(x = "Value",
caption = label_text)
return(p)
}
get_time_coefficients <- function(ids, timestamps, values) {
tbl <- tibble(
id = ids,
timestamp = timestamps,
value = values
)
coeffs <- tbl %>%
mutate(day_of_year = yday(timestamp)) %>%
group_by(id) %>%
summarize(correlation = cor(day_of_year, value))
return(coeffs)
}
plot_time_coefficients <- function(coeffs) {
bin_width <- 0.1
max_count <- get_histogram_max_count(coeffs$correlation, bin_width)
coeffs_mean <- round(mean(coeffs$correlation, na.rm = TRUE), digits = 2)
coeffs_median <- round(median(coeffs$correlation, na.rm = TRUE), digits = 2)
sum_stats <- data.frame(Statistics = c("Mean", "Median"),
value = c(coeffs_mean, coeffs_median))
label_text <- paste("Mean =", coeffs_mean, "\nMedian =", coeffs_median)
ggplot(coeffs, aes(x = correlation)) +
geom_histogram(binwidth = bin_width, color = "white") +
geom_vline(data = sum_stats,
aes(xintercept = value,
linetype = Statistics,
color = Statistics),
size = 1) +
scale_x_continuous(breaks = seq(-1.0, 1.0, by=bin_width)) +
scale_y_continuous(breaks = seq(0, max_count, by=1)) +
annotate("text", x = Inf, y = Inf, label = label_text,
hjust = 1, vjust = 1, size = 4) +
labs(title = "Correlations with Time",
caption = "Positive values imply variable of interest generally increased over time.",
x = "Coefficient of Correlation",
y = "Count")
}
get_time_coefficients_plot <- function(ids, timestamps, values) {
coeffs <- get_time_coefficients(ids, timestamps, values)
p <- plot_time_coefficients(coeffs)
return(p)
}
wide_to_stacked_bar_plot <- function(data_wide, key, value, key_order) {
if(!("id" %in% colnames(data_wide))) {
print("ERROR: data does not include \"id\" column: cannot convert.")
} else {
# Convert the data from wide to long. Set the factor levels to control the stacking order of the bars
data_long <- data_wide %>%
tidyr::gather(key = !!sym(key), value = !!sym(value), -id) %>%
mutate(!!sym(key) := factor(!!sym(key), levels = key_order))
# Order IDs in wide data based on value of first key, then rearrange long data.
# This ensures the resultant plot sorts the IDs in ascending order of the first key
first_key <- key_order[1]
data_wide <- data_wide %>% arrange(!!sym(first_key))
data_long$id <- factor(
data_long$id,
levels = data_wide$id)
# Plot graph with angled ID labels
p <- ggplot(data_long, aes(x= id, y= !!sym(value), fill= !!sym(key))) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# scale_y_continuous(breaks = seq(0, 24, by = 1))
}
}
scatter_with_LOBF <- function(data_x, data_y, labels = NULL) {
data <- tibble(
x = data_x,
y = data_y
)
p <- ggplot(data,
aes(x = x, y = y)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
color = "blue") +
stat_cor(mapping=aes(label=..rr.label..),
method="pearson",
label.x=-Inf,
label.y=Inf,
hjust = -0.1,
vjust = 1.1) +
geom_text(aes(label = sprintf("Gradient: %.2f", coef(lm(y ~ x, data = data))[2])),
x = -Inf,
y= Inf,
hjust = -0.05,
vjust = 3.5)
if(!is.null(labels) && length(labels > 1)) {
p <- p + labs(title = paste(labels[1], "vs.", labels[2]),
x = labels[1],
y = labels[2])
}
return(p)
}
Heart-rate Tracking: “Use it to track your workout progress and optimize personal training routines.”
Exercise Tracking: “Ivy will recognize your activity during the day, help you track up to 80 types of activity, count your steps, and discover how all that affects your body.”
Step Counts
Do people count their steps? Yeah, all but three of them did
Do people engage in different types of activity? Yeah, big spread of Fairly/Very Active intensity levels, safe to say people don’t all work out the same, so the more activity tracking the merrier
These three features are closely related, since one of the primary functions of the HR tracking is to detect exercise intensity; steps and distance tracking can also be used to further analyse users’ activity.
In order to better understand how people are using their FitBits, I analysed the amounts and intensity of different users exercise.
I start by getting the top-level breakdown of users time vs intensity:
# For each ID, generate averages for the three non-sedentary intensity levels
mean_daily_intensities_wide <- intensity_sum_days_wide %>%
group_by(id) %>%
summarize("sedentary" = mean(minutes_sedentary) / 60,
"lightly_active" = mean(minutes_lightly_active) / 60,
"fairly_active" = mean(minutes_fairly_active) / 60,
"very_active" = mean(minutes_very_active) / 60)
#Convert data to long-format for plotting as a histogram
intensity_order = c("sedentary", "lightly_active", "fairly_active", "very_active")
mean_daily_intensities_long <- mean_daily_intensities_wide %>%
tidyr::gather(key = "intensity", value = "mean_hours", -id) %>%
mutate(intensity = factor(intensity, levels = intensity_order))
# Reorder IDs in order of "Very Active" time
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
mutate("total_active" = fairly_active + very_active) %>%
arrange(total_active)
# Apply order of IDs to long-format data to force plot order
mean_daily_intensities_long$id <- factor(
mean_daily_intensities_long$id,
levels = mean_daily_intensities_wide$id)
viz_avg_time_by_intensity <- ggplot(mean_daily_intensities_long, aes(x= id, y= mean_hours, fill= intensity)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(breaks = seq(0, 24, by = 1)) +
labs(title="Average Time by Intensity Zone",
x = "User ID",
y = "Total Average Daily Time (hours)")
print(viz_avg_time_by_intensity)
Preliminary Findings:
Given how large the majority of non-Sedentary time is “Lightly Active”, and based on our understanding of the Intensity zones from previous sections, I will assume “Lightly Active” time includes any activity more intense than sitting down, up to an including activities like walking for leisure. Anything more intense would therefore fall into the “Fairly Active” or “Very Active” categories. With this in mind, I’ll proceed with the assumption that the “Fairly Active” and “Very Active” zones represent intentional exercise.
Next, I analyse users time spent intentionally exercising:
# Update total_active to reflect new definition
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
mutate(total_active = fairly_active + very_active) %>%
arrange(total_active)
mean_active_hours <- mean_daily_intensities_wide %>%
select(id, total_active)
# Add cumulative user count for Pareto analysis
total_users <- nrow(mean_active_hours)
bin_width <- 0.25
rounding_factor <- 1 / bin_width
mean_active_hours <- mean_active_hours %>%
mutate(total_active_rounded = round(total_active * rounding_factor) / rounding_factor) %>%
mutate(cumulative_users = row_number() / total_users)
mean_hours <- mean(mean_active_hours$total_active)
median_hours <- median(mean_active_hours$total_active)
ymax <- 12
xmax <- 13.5
viz_active_hrs_per_week <- ggplot(mean_active_hours, aes(x=total_active)) +
geom_histogram(binwidth = bin_width, col="white") +
scale_x_continuous(breaks = seq(0, xmax, by = bin_width)) +
geom_line(aes(x = total_active_rounded,
y = cumulative_users * ymax),
col="red") +
geom_vline(aes(xintercept = mean_hours, col = 'red')) +
geom_vline(aes(xintercept = median_hours), col = 'blue') +
scale_y_continuous(name = "Number of Users",
breaks = seq(0, ymax, by = 1),
sec.axis = sec_axis(~./ymax,name = "Cumulative Percentage of Users")) +
labs(title = "Active Hours per Week",
x = "Hours")
viz_active_hrs_per_week <- plot_histo_pareto(mean_active_hours$total_active, bin_width = 0.25) +
labs(title = "Active Hours per Week",
x = "Hours")
print(viz_active_hrs_per_week)
Preliminary Findings:
Here I analyse the data to see if there is any variation in the level of intensity of exercise between users. The variation was analysed by examining the ratio of “Fairly Active” to “Very Active” time for each user:
# For each ID, generate proportion Very Active time
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
mutate(proportion_very_active = very_active / total_active)
# Convert data to long-format for plotting as a histogram
intensity_order = c("fairly_active", "very_active")
mean_daily_intensities_long <- mean_daily_intensities_wide %>%
tidyr::gather(key = "intensity", value = "mean_hours", -id) %>%
mutate(intensity = factor(intensity, levels = intensity_order)) %>%
filter(intensity == "fairly_active" | intensity == "very_active")
# Reorder IDs in order of "Very Active" time
mean_daily_intensities_wide <- mean_daily_intensities_wide %>%
arrange(proportion_very_active)
# Apply order of IDs to long-format data to force plot order
mean_daily_intensities_long$id <- factor(
mean_daily_intensities_long$id,
levels = mean_daily_intensities_wide$id)
viz_very_active_proportion <- plot_histo_pareto(mean_daily_intensities_wide$proportion_very_active,
bin_width = 0.1) +
labs(title = "Percentage of Active Time spent Very Active")
print(viz_very_active_proportion)
Preliminary Findings:
The proportion of Active Time spent Very Active was plotted against the total Active Time to see if there was a correlation. From the earlier analysis of overall Active Hours per week, we can see that Active Hours are skewed right and concentrated towards the lower end: given this distribution, I also re-ran the analysis twice, once with the upper quintile of users removed, and once with the lower quintile removed. The results are plotted below.
plot_all_data <- ggplot(mean_daily_intensities_wide, aes(x=total_active, y=proportion_very_active)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
color = "blue") +
stat_cor(aes(label=..rr.label..),
label.x=0,
label.y=0) +
labs(title="Proportion Very Active vs. Total Active Time",
x = "Average Weekly Active Time (hours)",
y = "Proportion of Very Active Time (%)")
data <- mean_daily_intensities_wide %>%
filter(total_active <= 1.25) # id != "3977333714")
plot_no_upper_quint <- ggplot(data, aes(x=total_active, y=proportion_very_active)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
color = "blue") +
stat_cor(aes(label=..rr.label..),
label.x=0,
label.y=0) +
labs(title="Proportion Very Active vs. Total Active Time (Highest Quintlie Removed)",
x = "Average Weekly Active Time (hours)",
y = "Proportion of Very Active Time (%)")
data <- mean_daily_intensities_wide %>%
filter(total_active > 0.25) # id != "3977333714")
plot_no_lower_quint <- ggplot(data, aes(x=total_active, y=proportion_very_active)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
color = "blue") +
stat_cor(aes(label=..rr.label..),
label.x=0,
label.y=0) +
labs(title="Proportion Very Active vs. Total Active Time (Lowest Quintile Removed)",
x = "Average Weekly Active Time (hours)",
y = "Proportion of Very Active Time (%)")
rm(data)
viz_very_active_vs_total_active <- plot_all_data +
scale_x_continuous(limits=c(0,2)) +
scale_y_continuous(limits=c(0,1))
print(viz_very_active_vs_total_active)
# plot_no_upper_quint <- plot_no_upper_quint +
# scale_x_continuous(limits=c(0,2)) +
# scale_y_continuous(limits=c(0,1))
# plot_no_lower_quint <- plot_no_lower_quint +
# scale_x_continuous(limits=c(0,2)) +
# scale_y_continuous(limits=c(0,1))
#
# grid.arrange(viz_very_active_vs_total_active, plot_no_upper_quint, plot_no_lower_quint, ncol = 3)
Preliminary Findings: * Very Active Proportion was positively correlated with Active time, with an R^2 value of 0.27. * Removing the upper quintile decreased the correlation to 0.23, indicating that those doing less exercise also do a wider range of intensities * Removing the lower quintile increased the correlation to 0.32, indicating that those doing the most exercise are doing a narrower range of higher-intensity activities. * Overall, the results indicate that the majority of the cohort is engaged in a wide range of activities. There is a large subset of the group doing a low amount of varied exercise, as well as a small group of outliers doing a higher amount of more-intense exercise.
Why do I care: If people like to run, we can market it at those people
How do we find out if people run? Steps/second Distance/second How will we know if they’re running? Speed above a certain level
I can use the src logs directly as they are logged every minute, so each data point is also an average speed in steps per minute. From this, I can set up some ranges for different walking types based on speed, count the number of steps logs in each range for each person/day, then finally average the sums over the days to get each users’ average time spent at each speed
brisk_walk_thld <- 80
running_thld <- 110
mean_movement_times_daily <- steps_src_mins_tall %>%
mutate(activity_day = as.POSIXct(format(as.Date(activity_minute)))) %>%
group_by(id, activity_day) %>%
summarize(not_walking = sum(steps == 0) / 60,
moderate_walking = sum(steps > 0 & steps <= brisk_walk_thld) / 60,
brisk_walking = sum(steps > brisk_walk_thld & steps <= running_thld) / 60,
running = sum(steps > running_thld) / 60)
mean_movement_times <- mean_movement_times_daily %>%
group_by(id) %>%
summarize(not_walking = mean(not_walking),
moderate_walking = mean(moderate_walking),
brisk_walking = mean(brisk_walking),
running = mean(running)) %>%
arrange(running)
speed_order = c("not_walking", "moderate_walking", "brisk_walking", "running")
mean_movement_times_long <- mean_movement_times %>%
tidyr::gather(key = "speed", value = "mean_hours", -id) %>%
mutate(speed = factor(speed, levels = speed_order))
mean_movement_times_long$id <- factor(
mean_movement_times_long$id,
levels = mean_movement_times$id
)
ggplot(mean_movement_times_long %>% filter(speed != "not_walking"),
aes(x = id, y = mean_hours, fill = speed)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(breaks = seq(0,33,by=1)) +
labs(title = "Average Daily Running Speeds by User ID",
x = "User ID",
y = "Average Daily Hours")
viz_mean_mod_walking <- plot_histo_pareto(mean_movement_times$moderate_walking,
bin_width = 0.25) +
labs(title = "Average Moderate Walking Hours per Week",
x = "Average Hours per Week")
print(viz_mean_mod_walking)
viz_mean_brisk_walking <- plot_histo_pareto(mean_movement_times$brisk_walking,
bin_width = 0.25) +
labs(title = "Average Brisk Walking Hours per Week",
x = "Average Hours per Week")
print(viz_mean_brisk_walking)
viz_mean_running <- plot_histo_pareto(mean_movement_times$running,
bin_width = 0.25) +
labs(title = "Average Running Hours per Week",
x = "Average Hours per Week")
print(viz_mean_running)
grid.arrange(viz_mean_mod_walking,
viz_mean_brisk_walking,
viz_mean_running,
ncol = 1)
Preliminary Findings:
viz_mean_mod_walking_vs_time <- get_time_coefficients_plot(mean_movement_times_daily$id,
mean_movement_times_daily$activity_day,
mean_movement_times_daily$moderate_walking) +
labs(title = "Moderate Walking Over Time")
viz_mean_brisk_walking_vs_time <- get_time_coefficients_plot(mean_movement_times_daily$id,
mean_movement_times_daily$activity_day,
mean_movement_times_daily$brisk_walking) +
labs(title = "Brisk Walking Over Time")
viz_mean_running_vs_time <- get_time_coefficients_plot(mean_movement_times_daily$id,
mean_movement_times_daily$activity_day,
mean_movement_times_daily$running) +
labs(title = "Running Over Time")
grid.arrange(viz_mean_mod_walking_vs_time,
viz_mean_brisk_walking_vs_time,
viz_mean_running_vs_time,
nrow = 3)
Preliminary Findings:
mean_daily_steps <- steps_sum_days_tall %>%
group_by(id) %>%
summarize(mean_steps = mean(steps_total))
viz_mean_daily_steps <- plot_histo_pareto(mean_daily_steps$mean_steps, 2000) +
labs(title = "Average Daily Steps by User",
x = "Average Daily Steps")
print(viz_mean_daily_steps)
Answer: Typically 7500.
viz_mean_daily_steps_vs_time <- get_time_coefficients_plot(steps_sum_days_tall$id,
steps_sum_days_tall$activity_day,
steps_sum_days_tall$steps_total) +
labs(title = "Correlation between Step Counts and Time")
print(viz_mean_daily_steps_vs_time)
Preliminary Findings:
Examples of stationary exercise include gym-based activities like treadmill running and weight-lifting. Examples of mobile exercise include outdoor activities like jogging, but also indoor activities like squash or, depending on the accuracy of the FitBit distance tracking, gym workouts based on rotation through various equipment and exercises spaced around the venue. Determining if users have a preference for one activity type over another can help guide the marketing for the Ivy.
I analysed stationary versus mobile activity preferences by plotting the data for Active Minutes against Active Distance. The theory was that the more stationary exercise users do, the weaker the correlation between active time and active distance should be, because users are not moving while exercising (i.e. machine-based cardio and resistance training).
distance_vs_active <- activity_sum_days_wide %>%
select(id,
activity_day,
distance_moderately_active,
distance_very_active,
minutes_fairly_active,
minutes_very_active) %>%
mutate(distance_active = distance_moderately_active + distance_very_active,
minutes_active = minutes_fairly_active + minutes_very_active) %>%
group_by(id) %>%
summarize(mean_distance_active = mean(distance_active),
mean_minutes_active = mean(minutes_active))
viz_distance_vs_activity <- scatter_with_LOBF(distance_vs_active$mean_minutes_active,
distance_vs_active$mean_distance_active,
c("Mean Daily Active Minutes", "Mean Daily Active Distance")) +
labs(caption = "\"Active\" refers to time spent either Fairly or Very Active") +
scale_y_continuous(breaks = seq(0,10,by=1))
print(viz_distance_vs_activity)
Preliminary Findings:
Overall, based on the strong correlation between Active Time and Active distance, and the scale of the distances covered by active users, it would appear that the cohort engages in a variety of mobile exercise activities. The exact breakdown of mobile vs. stationary activities has not been determined, but I believe it is safe to say that the marketing for the Ivy should represent mobile and outdoor activities more strongly over indoor, gym-based, or other stationary activities.
Subtitle: “Includes Resting Heart Rate, Respiratory Rate, and Cardiac Coherence functions”
Description:
The Readiness Score is one of several features of the Ivy that can be marketed as a value-add over the FitBit devices in the data set. The feature makes use of several functions of the device, some of which the FitBits in the data set do not have, and builds on other features already available, like Respiratory Rate and Resting Heart Rate
The FitBit devices in the dataset don’t log any direct equivalent to the Readiness Score, nor does the documentation for the devices describe any comparable feature.
While the Blaze does have a breathing-detection function, it does not appear to be tracked over time in the data, and the Blaze manual describes it used only for “Guided Breathing” activities, which track respiratory rate only during the activity.
The Surge manual also mentions that (“Wearing your Surge while you sleep has the added benefit of making the Resting Heart Rate measurements on your dashboard more accurate”)[https://myhelp.fitbit.com/resource/manual_surge_en_US], so it appears to have that feature as well, but it is not built into an overall “Readiness Score”
This makes a potentially valuable function of the device, but also makes it difficult to investigate the existing data for evidence of how users might make use of this feature.
One aspect we can analyse is whether or not users currently wear their devices to bed.
Nighttime usage of FitBits is particularly relevant to the Readiness Score, as that function does not work unless the user wears their device to bed. If people are already frequently wearing their devices to bed, then the marketing can leverage this tendency to push the feature as an improvement on existing features. On the other hand, if users are not wearing their devices overnight, this may indicate that users are not interested in existing overnight-monitoring functions, or that there are other factors that make overnight wear unappealing, which marketing teams will need to take into account.
As of April 2016, the following wristband FitBits were available:
| Model | Sleep Logging | Heart-rate Tracking | Source |
|---|---|---|---|
| Alta | Yes | No | User Manual |
| Blaze | Yes | Yes | User Manual |
| Charge | Yes | No | User Manual |
| Charge HR | Yes | Yes | User Manual |
| Flex 1 | Yes | No | User Manual |
| Flex 2 | Yes | No | User Manual |
| Surge | Yes | Yes | User Manual |
Preliminary Findings:
For each user, I calculated the number of nights where the total sleep time logged was greater than 6 hours: this value was picked as being close enough to the recommended 8 hours of sleep to indicate the user did actually go to bed with their device on, but not so high that it would not pick up data from users who don’t get a full night’s sleep.
hours_asleep <- 6
total_nights <- 33
nights_logged_by_sleep_log <- sleep_sum_days_wide %>%
group_by(id) %>%
summarize(count_logged_asleep = sum(minutes_asleep_total / 60 > hours_asleep),
count_logged_in_bed = sum(minutes_in_bed_total / 60 > hours_asleep),
pct_logged_asleep = count_logged_asleep / total_nights,
pct_logged_in_bed = count_logged_in_bed / total_nights)
plot_histo_pareto(nights_logged_by_sleep_log$pct_logged_in_bed, bin_width = 0.05) +
labs(title = "Percentage of Nights Logged In Bed - Logged Users Only",
subtitle = "Source: Sleep Log Data")
# Add missing ID values to the dataset to analyse the full cohort
missing_ids <- setdiff(unique(activity_sum_days_wide$id), nights_logged_by_sleep_log$id)
for (id in missing_ids) {
nights_logged_by_sleep_log <- nights_logged_by_sleep_log %>%
rbind(.,data.frame(
id = id,
count_logged_asleep = 0,
count_logged_in_bed = 0,
pct_logged_asleep = 0.0,
pct_logged_in_bed = 0.0))
}
plot_histo_pareto(nights_logged_by_sleep_log$pct_logged_in_bed, bin_width = 0.05) +
labs(title = "Percentage of Nights Logged In Bed - All Users",
subtitle = "Source: Sleep Log Data")
Preliminary Findings:
Given that the FitBit devices available at the time the dataset was recorded all have sleep-tracking functionality, these findings indicate that the majority of users are not making use of this function. One possible reason for this is indicated by the manual for the FitBit Surge, which informs the user that “wearing your tracker 24/7 does not allow your skin to breathe”
The sleep logs data gives a good indication of whether users are interested in sleep logging data, but it doesn’t directly confirm whether people are wearing their devices to bed. To investigate this, I also analysed the heart-rate data for each user. This is by far the largest and least-processed data set, but it has the advantage of only generating data if the device is physically contacting the user.
Taking the brute-force approach to plotting the data for each user makes it much easier to visualise the patterns in user behaviour:
TODO: Widen out this plot
data <- heartrate_src_seconds_tall %>%
mutate(is_nighttime = hour(heart_rate_second) >= 22 | hour(heart_rate_second) < 6)
p <- ggplot(data) +
geom_point(aes(x = heart_rate_second,
y = heart_rate,
color = is_nighttime),
size = 0.1,
shape = 3) +
facet_wrap(vars(id))
print(p)
ggsave("viz_all_heartrate_data.png",
plot = p,
device = "png",
width = 3840,
height = 2160,
units = "px")
rm(data)
Once R is done chewing through the data, we can see that all users have clear gaps in their heart-rate logs.
The heart rate data takes the form of individual polls recording the user’s momentary heart rate. This presents a challenge compared to using the summary data, in that these individual logs must be processed to determine when the user was wearing their device continuously. Counting the number of logs over a given time period will not work, as the time between logs varies even when the device is worn. Polling times less than 60 seconds, for example, are typically between 1 and 20 seconds, as shown below:
heartrate_polling_variance <- heartrate_src_seconds_tall %>%
group_by(id) %>%
mutate(polling_diff = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
filter(polling_diff < 60)
ggplot(heartrate_polling_variance) +
geom_boxplot(aes(y = polling_diff)) +
scale_y_continuous(breaks = seq(0,60,by=5))
This means that even choosing the median value of 5 seconds could result in an inferred “time logged” value that’s out by a factor of 5.
To build out the data, I decided on an approach where the time between logs was used to infer the time logged, but only within an acceptable time difference. If one log follows its predecessor by less than this difference, the time between the two is considered “Logged” time: if the delay is greater, we assume the device was taken off for that time.
From the polling variance analysis boxplot, we saw that the median, 75th-percentile, and Maximum values were 5 seconds, 10 seconds, and 17 seconds, respectively. Given this range, I selected 20 seconds as the maximum acceptable time between polls.
# Find nights where enough heart-rate data was logged to imply the user slept with their FitBit on
sec2hour <- 1 / 3600
# Sleep range set to between 10pm and 6am
sleep_range_stt_hour <- 22
sleep_range_end_hour <- 6
# Anything more than this time between logs is considered a removal (typical poll rate is 1-20 seconds)
max_poll_gap_secs <- 20
# Minimum six hours must be logged to be counted
min_hours_logged <- 6
nights_total <- 33
nights_logged_by_heart_rate <- heartrate_src_seconds_tall %>%
# For logs in the AM, the night-of is set to the day before
mutate(night_of_yday = ifelse(hour(heart_rate_second) >= sleep_range_stt_hour, yday(heart_rate_second),
ifelse(hour(heart_rate_second) < sleep_range_end_hour, yday(heart_rate_second) - 1,
NA))) %>%
# Only consider logs from the "sleep" range
filter(!is.na(night_of_yday)) %>%
# Group by id so that difftime only compares timestamps from the same user
group_by(id) %>%
mutate(time_since_last_poll = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
# Remove any logs that are followed by too long a gap
filter(time_since_last_poll <= max_poll_gap_secs) %>%
group_by(id, night_of_yday) %>%
# Find the total time logged as the sum of time between valid logs
summarize(logged_time_hours = sum(time_since_last_poll) * sec2hour) %>%
# Count up the nights where each user logged sufficient time to be considered asleep with their fitbit
group_by(id) %>%
summarize(nights_logged = sum(logged_time_hours > min_hours_logged),
nights_logged_pct = nights_logged / nights_total)
# Plot histogram showing who logged what fraction of their nights
plot_histo_pareto(nights_logged_by_heart_rate$nights_logged_pct, bin_width = 0.05) +
labs(title = "Percentage of Nights Logged In Bed - HR Users Only",
subtitle = "Source: Heart Rate Data")
#
# # Add missing ID values to the dataset to analyse the full cohort
# missing_ids <- setdiff(unique(activity_sum_days_wide$id), nights_logged_by_heart_rate$id)
# for (id in missing_ids) {
# nights_logged_by_heart_rate <- nights_logged_by_heart_rate %>%
# rbind(.,data.frame(
# id = id,
# nights_logged = 0,
# nights_logged_pct = 0.0))
# }
#
# # Plot histogram showing who logged what fraction of their nights
# plot_histo_pareto(nights_logged_by_heart_rate$nights_logged_pct, bin_width = 0.05) +
# labs(title = "Percentage of Nights Logged In Bed - All Users",
# subtitle = "Source: Heart Rate Data")
Given the small number of heart-rate users, I also compared the the heart-rate results directly to the sleep log results to see if there was any difference in the nights logged for individual IDs:
sleep_data_merged <- merge(nights_logged_by_sleep_log, nights_logged_by_heart_rate, by = "id", all = TRUE) %>%
select(id, pct_logged_asleep, nights_logged_pct) %>%
rename(sleep_logs = pct_logged_asleep,
heartrate_logs = nights_logged_pct)
sleep_data_merged <- tidyr::gather(sleep_data_merged, key = "Variable", value = "Value", -id)
ggplot(sleep_data_merged, aes(x = id, y = Value, fill = Variable)) +
geom_bar(stat = "identity", position = "dodge", width = 0.7, color = "black") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1)) +
labs(title = "Nights Logged: Sleep Logs vs. Heartrate Logs",
x = "ID",
y = "% of Nights Logged",
fill = "Source")
For those IDs present in both data sets, the results were almost identical, with only three users showing variation equating to 2-3 nights, indicating the method is valid.
Preliminary Findings:
From the previous analysis for the Readiness Score feature, we know that sleep tracking is not commonly used, with a majority of 51% (17 users) logging 10% of nights or fewer, and the remainder spread out over a range up to 90% of nights.
sleep_sum_quant_qual <- sleep_src_mins_tall %>%
# Set the date for sleep logs in the AM to the day before, i.e. the Night Of [date]
mutate(sleep_date = as.Date(ifelse(hour(sleep_minute) > 12,
as.Date(sleep_minute),
as.Date(sleep_minute) - 1))) %>%
# mutate(sleep_date = as.Date(sleep_date)) %>%
group_by(id, sleep_date) %>%
summarize(minutes_awake = sum(sleep_rank == "Awake"),
minutes_restless = sum(sleep_rank == "Restless"),
minutes_asleep = sum(sleep_rank == "Asleep"),
minutes_total = minutes_awake + minutes_restless + minutes_asleep,
pct_awake = minutes_awake / minutes_total,
pct_restless = minutes_restless / minutes_total,
pct_asleep = minutes_asleep / minutes_total)
sleep_sum_quant_qual_mean <- sleep_sum_quant_qual %>%
group_by(id) %>%
summarize(across(.cols = -sleep_date, \(x) mean(x, na.rm = TRUE), .names = "{.col}_mean")) %>%
mutate(hours_total_mean = minutes_total_mean / 60)
plot_histo_pareto(sleep_sum_quant_qual_mean$hours_total_mean, bin_width = 1) +
labs(title = "Average Hours Asleep",
x = "Hours Asleep")
Preliminary Findings:
plot_histo_pareto(sleep_sum_quant_qual_mean$pct_awake_mean, bin_width = 0.05) +
labs(title = "Percentage of Sleep Logs Spent Awake",
x= "% of Time")
plot_histo_pareto(sleep_sum_quant_qual_mean$pct_restless_mean, bin_width = 0.05) +
labs(title = "Percentage of Sleep Logs Spent Restless",
x= "% of Time")
plot_histo_pareto(sleep_sum_quant_qual_mean$pct_asleep_mean, bin_width = 0.05) +
labs(title = "Percentage of Sleep Logs Spent Asleep",
x= "% of Time")
Preliminary Findings:
# get_time_coefficients_plot <- function(ids, timestamps, values)
ggplot(sleep_sum_quant_qual,
aes(x = sleep_date,
y = pct_asleep)) +
geom_point() +
geom_smooth(method = "lm",
se = FALSE,
color = "blue") +
stat_cor(mapping=aes(label=..rr.label..),
method="pearson",
label.x=-Inf,
label.y=Inf,
hjust = -0.1,
vjust = 1.1) +
facet_wrap(vars(id))
# Brute-force Preliminary Findings: Most lines are flat, with some (5-6) that are slightly down
I analysed sleep quantity and quality over time by calculating the coefficients with time of “total hours asleep” and “percentage of time awake”, respectively:
get_time_coefficients_plot(sleep_sum_quant_qual$id,
sleep_sum_quant_qual$sleep_date,
sleep_sum_quant_qual$minutes_asleep) +
labs(title = "Coefficient of Time: Total Time Asleep")
get_time_coefficients_plot(sleep_sum_quant_qual$id,
sleep_sum_quant_qual$sleep_date,
sleep_sum_quant_qual$pct_asleep) +
labs(title = "Coefficient of Time: % of Time Asleep")
Preliminary Findings:
Although the FitBits in the data set do have heart rate and breathing rate tracking, there is no data for resting heart rate or cardiac coherence specifically. What I can analyse is the data on heart-rate tracking generally, specifically:
This will help me understand whether or not the RHR and CC functions will appeal to the user base: for instance, users who rarely wear their existing heart-rate monitors, or only wear them during intense exercise, may be less likely to care about their longer-term cardiac performance.
From my previous analysis on heart-rate data, we know that overnight heart-rate tracking is split, with 50% (7 users) logging sleep on 35% to 90% of nights, and the other 50% logging sleep on 0% to 15% of nights.
I can adapt the code from the nighttime analysis to investigate daytime usage:
# Find nights where enough heart-rate data was logged to imply the user slept with their FitBit on
sec2hour <- 1 / 3600
# Sleep range set to between 10pm and 6am
period_stt_hour <- 6
period_end_hour <- 22
# Typical poll rate is 1-20 seconds: increased to 60 to allow for device moving out of position during daytime movement
max_poll_gap_secs <- 60
# Minimum six hours must be logged to be counted
min_hours_logged <- 6
periods_total <- 33
logging_days <- TRUE
if (logging_days) {
periods_logged <- heartrate_src_seconds_tall %>%
mutate(yday = ifelse((hour(heart_rate_second) >= period_stt_hour & hour(heart_rate_second) < period_end_hour),
yday(heart_rate_second),
NA))
} else {
# For night logging, set yday to the day before, i.e. "Night of [yday]"
periods_logged <- heartrate_src_seconds_tall %>%
mutate(yday = ifelse(hour(heart_rate_second) >= period_stt_hour,
yday(heart_rate_second),
ifelse(hour(heart_rate_second) < period_end_hour,
yday(heart_rate_second) - 1,
NA)))
}
periods_logged <- periods_logged %>%
# Only consider logs within the period of interest
filter(!is.na(yday)) %>%
# Group by id so that difftime only compares timestamps from the same user
group_by(id) %>%
mutate(time_since_last_poll = as.double(difftime(heart_rate_second, lag(heart_rate_second), units = "secs"))) %>%
# Remove any logs that are followed by too long a gap
filter(time_since_last_poll <= max_poll_gap_secs) %>%
group_by(id, yday) %>%
# Find the total time logged as the sum of time between valid logs
summarize(logged_time_hours = sum(time_since_last_poll) * sec2hour) %>%
# Count up the nights where each user logged sufficient time to be considered asleep with their fitbit
group_by(id) %>%
summarize(periods = sum(logged_time_hours > min_hours_logged),
periods_pct = periods / periods_total)
# Plot histogram showing who logged what fraction of their nights
plot_histo_pareto(periods_logged$periods_pct, bin_width = 0.05) +
labs(title = "Percentage of Periods Logged",
subtitle = "Source: Heart Rate Data")
Preliminary Findings:
First the naive approach: What percentage of various users’ time is spent in what heart-rate zone?
# From the Charge HR manual:
# Sedentary = 0-50% Max HR
# Lightly Active = 50-70% Max HR
# Fairly Active = 70-85% Max HR
# Very Active = 85-100% Max HR
# Max HR = 220 - age
# Arbitrarily set median age at 30 for initial analysis
assumed_median_age <- 30
max_heart_rate <- 220 - assumed_median_age
hr_zone_1 <- 0.5 * max_heart_rate
hr_zone_2 <- 0.7 * max_heart_rate
hr_zone_3 <- 0.85 * max_heart_rate
hr_zone_data <- heartrate_src_seconds_tall %>%
mutate(hr_zone = ifelse(heart_rate >= 0 & heart_rate < hr_zone_1, "sedentary",
ifelse(heart_rate >= hr_zone_1 & heart_rate < hr_zone_2, "lightly_active",
ifelse(heart_rate >= hr_zone_2 & heart_rate < hr_zone_3, "fairly_active",
ifelse(heart_rate >= hr_zone_3, "very_active", NA))))) %>%
group_by(id) %>%
summarize(sedentary = sum(hr_zone == "sedentary"),
lightly_active = sum(hr_zone == "lightly_active"),
fairly_active = sum(hr_zone == "fairly_active"),
very_active = sum(hr_zone == "very_active"))
#Convert data to long-format for plotting as a histogram
hr_zone_order = c("sedentary", "lightly_active", "fairly_active", "very_active")
hr_zone_data_long <- hr_zone_data %>%
tidyr::gather(key = "hr_zone", value = "count", -id) %>%
mutate(hr_zone = factor(hr_zone, levels = hr_zone_order))
ggplot(hr_zone_data_long, aes(x= id, y= count, fill= hr_zone)) +
geom_bar(stat = "identity",
position = "fill") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title="Count of HR Logs by HR Zone",
x = "User ID",
y = "Count")
Preliminary Findings:
Several functions and features of the Ivy are not present on the FitBits in the data set. Given the lack of data, it’s not possible to directly analyse existing user behaviours with respect to these features, which include:
In which I summarise my recommendations
Most users are not logging sleep
Those that are appear to already be getting high-quality sleep
There is nothing to indicate the use of the FitBits improved the quantity or quality of their sleep above this high baseline
Don’t market this feature as something that will improve your sleep
Instead, given the consistency of users in getting their sleep, as evidenced by the small correlation with time for sleep quantity and quality, the Sleep Tracking could be marketed more as a way of tracking and reinforcing the user’s existing good sleep habits
As with the Readiness Score, marketing teams could frame the feature more as a way of getting a heads-up on days where you haven’t slept well, since most of the time the feature would just tell users (from this cohort, at least) that they just had a reasonably good night’s sleep. This would work well with the “track existing habits” marketing angle: think “You’re great at giving your body the sleep it needs, but for those nights where things just don’t work out, Ivy can give you guidance on whether you’re up for the day, or better off taking it down a notch to recover.”
If it exists, highlight a manual-activation mode for the Sleep Tracking mode. The low amount of non-Asleep data in this data set may be a result of people mostly using the automatic mode, which is convenient, but doesn’t track time spent trying to get to sleep: a manual-activation feature could therefore be pushed harder as a sort of second feature that gets better data on how you go getting to sleep.
# Function Declarations ----
rename_df_variables <- function(df_name, var_mods) {
cat("Renaming variables in \"",df_name,"\"...\n", sep = "")
df <- get(df_name)
# Check each var name requiring correction against the var names in the df
for (i in 1:nrow(var_mods)) {
var_old = var_mods$var_old[i]
if (!(var_old %in% colnames(df))) {
next
}
# If found, make sure the conversion is applicable to this or all dfs
tbl <- var_mods$tbl[i]
if (tbl != "" && tbl != df_name) {
next
}
# Perform the conversion if all checks passed
var_new = var_mods$var_new[i]
cat("df: ",df_name, "\tvar_old: ",var_old,"\t",sep="")
cat("var_new: ",var_new,"\t", sep="")
cat("tbl: ",tbl," ", sep="")
cat("Replacing... ", sep = "")
df <- df %>% rename(!!var_new := !!var_old)
cat("Done.\n", sep = "")
}
rm(i)
cat("Renaming variables in \"",df_name,"\" complete.\n", sep = "")
return(df)
}
# Global Variable Declarations ----
var_mods <- data.frame(
var_old = character(0),
var_new = character(0),
type_new = character(0),
tbl = character(0)
)
# WARNING: Ensure table-specific modifications (tbl != "") are positioned above non-specific modifications with matching var_old/var_new values: only the first modification in the list will be applied to matching variables.
# TODO: Eliminate this issue by modifying code to warn/handle conflicting rows
var_mods <- var_mods %>%
rbind(.,data.frame(var_old="date", var_new="bodycomp_datetime", type_new="POSIXct", tbl="bodycomp_src_logs_wide")) %>%
rbind(.,data.frame(var_old="time", var_new="heart_rate_second", type_new="POSIXct", tbl="heartrate_src_seconds_tall")) %>%
rbind(.,data.frame(var_old="value", var_new="heart_rate", type_new="", tbl="heartrate_src_seconds_tall")) %>%
rbind(.,data.frame(var_old="date", var_new="sleep_minute", type_new="POSIXct", tbl="sleep_src_mins_tall")) %>%
rbind(.,data.frame(var_old="value", var_new="sleep_rank", type_new="", tbl="sleep_src_mins_tall")) %>%
rbind(.,data.frame(var_old="", var_new="activity_hour", type_new="POSIXct", tbl="")) %>%
rbind(.,data.frame(var_old="", var_new="activity_minute", type_new="POSIXct", tbl="")) %>%
rbind(.,data.frame(var_old="", var_new="sleep_day", type_new="POSIXct", tbl="")) %>%
rbind(.,data.frame(var_old="", var_new="id", type_new="character", tbl="")) %>%
rbind(.,data.frame(var_old="log_id", var_new="sleep_log_id", type_new="character", tbl="sleep_src_mins_tall")) %>%
rbind(.,data.frame(var_old="log_id", var_new="bodycomp_log_id", type_new="character", tbl="bodycomp_src_logs_wide")) %>%
rbind(.,data.frame(var_old="activity_date", var_new="activity_day", type_new="Date", tbl="")) %>%
rbind(.,data.frame(var_old="fairly_active_distance", var_new="distance_fairly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="fairly_active_minutes", var_new="minutes_fairly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="light_active_distance", var_new="distance_lightly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="light_active_minutes", var_new="minutes_lightly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="lightly_active_distance", var_new="distance_lightly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="lightly_active_minutes", var_new="minutes_lightly_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="logged_activities_distance", var_new="distance_logged_activities", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="me_ts", var_new="mets", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="moderately_active_distance", var_new="distance_moderately_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="sedentary_active_distance", var_new="distance_sedentary", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="sedentary_distance", var_new="distance_sedentary", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="sedentary_active_minutes", var_new="minutes_sedentary", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="sedentary_minutes", var_new="minutes_sedentary", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="step_total", var_new="steps_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_distance", var_new="distance_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_intensity", var_new="intensity_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_minutes_asleep", var_new="minutes_asleep_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_sleep_records", var_new="sleep_records_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_steps", var_new="steps_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="total_time_in_bed", var_new="minutes_in_bed_total", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="tracker_distance", var_new="distance_tracker", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="very_active_distance", var_new="distance_very_active", type_new="", tbl="")) %>%
rbind(.,data.frame(var_old="very_active_minutes", var_new="minutes_very_active", type_new="", tbl=""))